fix: redesign attempt_number semantics#781
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…ing (#662) - Make attempt_number 1-indexed (was 0-indexed) across all delivery paths - Manual retries derive attempt_number from logstore instead of hardcoding - Manual retry failures now interact with retry schedule (cancel + reschedule) - Budget exhaustion on manual retry cancels any lingering scheduled retry Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
be5a99d to
57b243e
Compare
Replace call-recording assertions with stateful map that mirrors real scheduler upsert/delete semantics. Tests now seed initial retry state and assert the resulting state after handler execution. Use ScheduledBackoff so each tier has a distinct delay, proving the schedule advances to the correct tier. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
57b243e to
131a85f
Compare
internal/apirouter/retry_handlers.go
Outdated
| // 3. Derive attempt number from existing attempts | ||
| attemptResp, err := h.logStore.ListAttempt(c.Request.Context(), logstore.ListAttemptRequest{ | ||
| TenantIDs: []string{event.TenantID}, | ||
| EventIDs: []string{req.EventID}, | ||
| DestinationIDs: []string{req.DestinationID}, | ||
| Limit: 1, | ||
| SortOrder: "desc", | ||
| }) | ||
| if err != nil { | ||
| AbortWithError(c, http.StatusInternalServerError, NewErrInternalServer(err)) | ||
| return | ||
| } | ||
| attemptNumber := 1 | ||
| if len(attemptResp.Data) > 0 { | ||
| attemptNumber = attemptResp.Data[0].Attempt.AttemptNumber + 1 | ||
| } |
There was a problem hiding this comment.
Although I don't have a better option, this is potentially problematic. Data in the log store is not atomic and can take time to be retrievable. It's possible to end up with duplicates (attempt number) and the performance of the query is likely not very good on CH.
Perharps that's acceptable tradeoff but I wonder if there's a better approach we haven't considered yet
There was a problem hiding this comment.
Hmm I don't see an alternative for deducing attempt_number based on available data. The only concern I have with the data is there's a 10-30s lag between the delivery and when the log data is persisted (batching). Otherwise, this query should be sufficient.
Perf is always something to consider because you can never retrieve a single item with CH but I think it's a tradeoff. Actually maybe we can use this attempt data and skip querying the Event, so it should theoretically be the same as before. It's also in the retry flow which I think is not as critical compared to the initial publish flow?
Retry executor now queries logstore for the latest attempt instead of carrying a stale attempt number in the RSMQ message. Idempotency key changed to event_id:destination_id:attempt_number so manual and auto retries with the same attempt number are deduplicated (race protection). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t number ListAttempt already returns the associated Event on each AttemptRecord. Replace the two-query pattern (RetrieveEvent + ListAttempt) with a single ListAttempt call in both the retry executor and the API retry handler. Remove RetrieveEvent from RetryEventGetter interface. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EnsureLocalStack/EnsureRabbitMQ/EnsureGCP were called before testinfra.Start(t) had a chance to skip, causing container startups in CI even with -short. Add testutil.CheckIntegrationTest(t) at the top of each test function so the skip runs first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
attempt_numberfrom logstore (was hardcoded 0)Test plan
internal/deliverymq,internal/apirouter,internal/models)cmd/e2e)TestE2E_ManualRetryScheduleInteractionvalidates manual retry timing and scheduler logicCloses #662
🤖 Generated with Claude Code